Skip to content

Attempt to optimize LocalExecutor #144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

james7132
Copy link
Contributor

@james7132 james7132 commented Jul 24, 2025

This PR makes LocalExecutor it's own separate implementation instead of wrapping an Executor, forking it's own State type utilizing unsynchronized !Send types where possible: Arc -> Rc, Mutex -> RefCell, AtomicPtr -> Cell<*mut T>, ConcurrentQueue -> VecDeque. This implementation also removes any extra operations that assumes there are other concurrent Runners/Tickers (i.e. local queues, extra notifications).

For testing, I've duplicated most of the single-threaded compatible executor tests to ensure there's equivalent coverage on the new independent LocalExecutor.

I've also made an attempt to write this with UnsafeCell instead of RefCell, but this litters a huge amount of unsafe that might be too much for this crate. The gains here might not be worth it.

I previously wrote some additional benchmarks but lost the changes to a stray git reset --hard when benchmarking.
The gains here are substantial. Here are the results:

single_thread/local_executor::spawn_one
                        time:   [130.05 ns 130.98 ns 132.16 ns]
                        change: [-80.586% -80.413% -80.214%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
single_thread/local_executor::spawn_batch
                        time:   [17.195 µs 22.167 µs 32.281 µs]
                        change: [-30.007% +0.4082% +41.137%] (p = 0.98 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  7 (7.00%) high mild
  10 (10.00%) high severe
Benchmarking single_thread/local_executor::spawn_many_local: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 856.5s, or reduce sample count to 10.
single_thread/local_executor::spawn_many_local
                        time:   [2.7766 ms 2.8091 ms 2.8453 ms]
                        change: [-21.653% -8.5454% +6.4249%] (p = 0.30 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
single_thread/local_executor::spawn_recursively
                        time:   [19.618 ms 20.071 ms 20.558 ms]
                        change: [-24.237% -22.461% -20.762%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
Benchmarking single_thread/local_executor::yield_now: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.7s, enable flat sampling, or reduce sample count to 50.
single_thread/local_executor::yield_now
                        time:   [1.8382 ms 1.8421 ms 1.8470 ms]
                        change: [-52.888% -52.744% -52.570%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe
single_thread/local_executor::channels
                        time:   [9.9259 ms 9.9355 ms 9.9452 ms]
                        change: [-15.797% -15.656% -15.533%] (p = 0.00 < 0.05)
                        Performance has improved.
single_thread/local_executor::web_server
                        time:   [54.839 µs 56.776 µs 59.062 µs]
                        change: [-23.926% -18.725% -13.009%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

@james7132
Copy link
Contributor Author

Hmmm, VecDeque::new is not const until Rust 1.68. It could be trivially replaced with LinkedList which has a const constructor in 1.63, but it would be a major regression in performance.

@taiki-e
Copy link
Collaborator

taiki-e commented Jul 25, 2025

There is no need to be worried about the MSRV of const VecDeque::new, because we will be able to raise the MSRV after about two weeks: smol-rs/smol#244 (comment)

@james7132
Copy link
Contributor Author

OK, did a more comprehensive benchmark run and it's pretty clear that the UnsafeCell version is tangibly faster:

group                                                     local-executor-opt                      local-executor-opt-w-unsafe-cell        master
-----                                                     ------------------                      --------------------------------        ------
single_thread/local_executor::channels                    1.18     12.2±0.69ms        ? ?/sec     1.00     10.3±0.08ms        ? ?/sec     1.41     14.6±0.93ms        ? ?/sec
single_thread/local_executor::spawn_many_local            1.11      2.7±0.20ms        ? ?/sec     1.00      2.4±0.15ms        ? ?/sec     1.25      3.0±0.20ms        ? ?/sec
single_thread/local_executor::spawn_one                   1.15    155.7±7.89ns        ? ?/sec     1.00   135.0±14.68ns        ? ?/sec     6.08   821.0±86.13ns        ? ?/sec
single_thread/local_executor::spawn_recursively           1.01     27.4±2.17ms        ? ?/sec     1.00     27.1±2.79ms        ? ?/sec     1.17     31.7±2.56ms        ? ?/sec
single_thread/local_executor::web_server                  1.25     20.2±1.63ms        ? ?/sec     1.00     16.2±0.10ms        ? ?/sec     1.60     25.9±1.69ms        ? ?/sec
single_thread/local_executor::yield_now                   1.18      2.3±0.12ms        ? ?/sec     1.00  1981.4±166.43µs        ? ?/sec    2.35      4.7±0.21ms        ? ?/sec
single_thread/static_local_executor::channels             1.38     13.8±1.80ms        ? ?/sec     1.00     10.0±0.13ms        ? ?/sec     1.63     16.3±1.70ms        ? ?/sec
single_thread/static_local_executor::spawn_many_local     1.20  1770.7±119.07µs        ? ?/sec    1.00  1474.5±28.53µs        ? ?/sec     1.70      2.5±0.11ms        ? ?/sec
single_thread/static_local_executor::spawn_one            1.49    154.8±8.22ns        ? ?/sec     1.00    104.0±6.94ns        ? ?/sec     7.28   756.9±66.28ns        ? ?/sec
single_thread/static_local_executor::spawn_recursively    1.53     26.4±2.18ms        ? ?/sec     1.00     17.3±1.05ms        ? ?/sec     1.61     27.8±3.10ms        ? ?/sec
single_thread/static_local_executor::web_server           1.19     20.0±1.05ms        ? ?/sec     1.00     16.7±0.32ms        ? ?/sec     1.71     28.6±3.43ms        ? ?/sec
single_thread/static_local_executor::yield_now            1.29      2.5±0.22ms        ? ?/sec     1.00  1916.6±165.43µs        ? ?/sec    2.54      4.9±0.19ms        ? ?/sec

This PR should be ready for review now.

@james7132 james7132 marked this pull request as ready for review August 9, 2025 05:11
@james7132 james7132 requested review from notgull and taiki-e August 9, 2025 05:26
@james7132
Copy link
Contributor Author

james7132 commented Aug 9, 2025

I don't think we can reasonably make LocalExecutor any faster with the current interface it has. Any more would probably require std LocalWaker support, a customized LocalTask<T> in async_task, or stack allocation of some kind.

@james7132 james7132 force-pushed the optimized-local-executor branch from d206459 to 13fbe6f Compare August 9, 2025 06:01
@james7132
Copy link
Contributor Author

Spoke too soon, added an early-out when the queue is empty to LocalExecutor::run and nabbed some additional performance improvements:

group                                                     local-executor-opt-w-unsafe-cell        local-executor-opt-w-unsafe-cell-and-early-out-run    master
-----                                                     --------------------------------        --------------------------------------------------    ------
single_thread/local_executor::channels                    1.04     10.3±0.08ms        ? ?/sec     1.00      9.9±0.15ms        ? ?/sec                   1.47     14.6±0.93ms        ? ?/sec
single_thread/local_executor::spawn_many_local            1.42      2.4±0.15ms        ? ?/sec     1.00  1703.3±37.62µs        ? ?/sec                   1.78      3.0±0.20ms        ? ?/sec
single_thread/local_executor::spawn_one                   1.44   135.0±14.68ns        ? ?/sec     1.00     94.1±2.81ns        ? ?/sec                   8.73   821.0±86.13ns        ? ?/sec
single_thread/local_executor::spawn_recursively           1.58     27.1±2.79ms        ? ?/sec     1.00     17.2±0.84ms        ? ?/sec                   1.85     31.7±2.56ms        ? ?/sec
single_thread/local_executor::web_server                  1.00     16.2±0.10ms        ? ?/sec     1.00     16.1±0.14ms        ? ?/sec                   1.60     25.9±1.69ms        ? ?/sec
single_thread/local_executor::yield_now                   1.19  1981.4±166.43µs        ? ?/sec    1.00  1670.0±18.72µs        ? ?/sec                   2.79      4.7±0.21ms        ? ?/sec
single_thread/static_local_executor::channels             1.03     10.0±0.13ms        ? ?/sec     1.00      9.8±0.06ms        ? ?/sec                   1.67     16.3±1.70ms        ? ?/sec
single_thread/static_local_executor::spawn_many_local     1.11  1474.5±28.53µs        ? ?/sec     1.00  1323.9±83.38µs        ? ?/sec                   1.90      2.5±0.11ms        ? ?/sec
single_thread/static_local_executor::spawn_one            1.38    104.0±6.94ns        ? ?/sec     1.00     75.2±1.06ns        ? ?/sec                   10.07  756.9±66.28ns        ? ?/sec
single_thread/static_local_executor::spawn_recursively    1.20     17.3±1.05ms        ? ?/sec     1.00     14.4±0.85ms        ? ?/sec                   1.93     27.8±3.10ms        ? ?/sec
single_thread/static_local_executor::web_server           1.03     16.7±0.32ms        ? ?/sec     1.00     16.3±0.10ms        ? ?/sec                   1.76     28.6±3.43ms        ? ?/sec
single_thread/static_local_executor::yield_now            1.25  1916.6±165.43µs        ? ?/sec    1.00  1530.8±30.15µs        ? ?/sec                   3.18      4.9±0.19ms        ? ?/sec

@notgull
Copy link
Member

notgull commented Aug 13, 2025

@james7132 Can you rebase on master?

@james7132 james7132 force-pushed the optimized-local-executor branch from 13fbe6f to c4f077f Compare August 13, 2025 09:34
@james7132
Copy link
Contributor Author

@james7132 Can you rebase on master?

Done!

#[inline]
pub(crate) fn state_ptr(&self) -> *const State {
#[cold]
fn alloc_state(cell: &Cell<*mut State>) -> *mut State {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we delay the allocation of the state? How beneficial is this from a performance perspective (given that it trades a potential "away optimization" of state allocation in case the LocalExecutor gets destroyed without usage for a branch on every state access (minus a few where the compiler can perhaps infer that the state is non-null / initialized))

Copy link
Contributor Author

@james7132 james7132 Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows LocalExecutor::new to be const, without it this would be a breaking change and would require a major version bump.

Exexutor has this as well for similar reasons and has an AtomicPtr instead of a Cell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, this comment makes me realize that the State in both satisfies all of the safety invariants of making the state be a Pin<Box<State>> instead of the reference counted types, and we can avoid the reference counting even without 'static borrows like the Static variants.

As far as I know this holds true even for the instances of either that get leaked into their Static variants.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm aware, this sounds correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, for the next major version bump it would be nice to pursue a version which doesn't use the internal Cell wrapping, at least for the LocalExecutor version, where "difficulty of sharing across threads" isn't that much of a concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went ahead and implemented the pin change, and filed #146 for the non-local version.

@james7132 james7132 force-pushed the optimized-local-executor branch from 2b5adca to d0a1f9e Compare August 14, 2025 00:01
fn spawn_inner<T: 'a>(
state: Pin<&'a State>,
future: impl Future<Output = T> + 'a,
active: &mut Slab<Waker>,
Copy link
Member

@fogti fogti Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has one aspect which bothers me:

It gets passed an active: &mut Slab<Waker>, but in its body it also constructs additional unsafe { &mut *state.active.get() }, violating the "exclusive / strictly nested" property of &mut references, which is undefined behavior both because it violates Rust's memory aliasing model, and as rustc annotates functions appropriately when encountering &mut in arguments, also can cause LLVM to deduce wrong properties about the function.

The places where such additional bad references are constructed is in the closure for future below, which e.g. would be invoked in the case of panics in the function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this problem also exists for the normal Executor::spawn_inner, but in this case the problem was plainly more obvious.

// Remove the task from the set of active tasks when the future finishes.
let entry = active.vacant_entry();
let index = entry.key();
let future = AsyncCallOnDrop::new(future, move || {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if Builder::new().propagate_panic(true) could be saved in a binding above this line.

It is not really necessary, but it makes it obvious that no panics can occur between the construction of AsyncCallOnDrop and the call to .spawn_unchecked, at which point the handling of drop unwinding gets taken over by Task, which automatically schedules itself in that case and notifies this LocalExecutor.

Note that entry.insert might allocate, and subsequently panic on OOM.
I think the easiest way to prevent panics in this function in a sort-of "critical section" w.r.t. state.active would be to invoke active.reserve(1); at the top of the function.

Copy link
Contributor Author

@james7132 james7132 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good catch! Unwinds continue to be the one control flow path that seems to be hard to find.

I went ahead and moved the builder outside of the "critical section".

reserve addresses the allocator OOM from the Slab, but it may also occur in the schedule with the VecDeque and adds extra work on the hotpath. I'd rather use a more robust / zero cost method and just opted to put an AbortOnPanic guard around the critical section. Rust's drop order should force it to run before the AsyncCallOnDrop is dropped.

Co-authored-by: Ellen Emilia Anna Zscheile <[email protected]>
Copy link
Member

@fogti fogti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM.
but as this is a complicated change, it should still be reviewed by others.

cc @smol-rs/admins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants